PCI-DMA/CPU Handoff for Increased Effectiveness of Checkpointing Functionalities in CCL
نویسندگان
چکیده
Checkpointing and Communication Library (CCL) is a recently developed software in support of optimistic parallel discrete event simulation on myrinet clusters. Beyond low latency message delivery functionalities, CCL also offers non-blocking checkpointing functionalities supported by a programmable PCI DMA engine on board of myrinet cards. CCL employs a re-synchronization functionality between PCI DMA activities and CPU activities to maintain the consistency of checkpointed information (i.e. to prevent the CPU from updating information that still needs to be copied through DMAing). If re-synchronization is invoked before the checkpoint operation is completed, simulation activities carried out by the CPU may be forced to wait for checkpoint completion. Since data copy through the PCI DMA is slower than what achievable with the CPU, in pathological situations a re-synchronization period may last more than a whole checkpoint operation performed by the CPU, thus nullifying the potential benefit from offloading checkpointing from the CPU. This paper tackles such an issue by presenting the design and implementation of a handoff mechanism of checkpoint operations between PCI DMA and CPU to enhance the effectiveness of checkpointing functionalities offered by CCL. Although a checkpoint operation is initially entrusted to the PCI DMA, whenever re-synchronization forces the simulation application to wait for its completion, the checkpoint operation is dynamically switched to the CPU, namely the fastest available device, since its timely completion has become a performance critical task for the simulation application.
منابع مشابه
Tuning of the Checkpointing and Communication Library for Optimistic Simulation on Myrinet Based NOWs
Recently a Checkpointing and Communication Library (CCL) for optimistic simulation on Myrinet based Network of Workstations (NOWs) has been presented. CCL ofloads checkpoint operations from the CPU by charging them to a programmable DMA engine on the Myrinet network card. CCL includes also functionalities for freezing the simulation application on demand, which can be used for data consistency ...
متن کاملMultiprogrammed non-blocking checkpoints in support of optimistic simulation on myrinet clusters
CCL (Checkpointing and Communication Library) is a software layer in support of optimistic Parallel Discrete Event Simulation (PDES) on myrinet-based COTS clusters. Beyond classical low latency message delivery functionalities, this library implements CPU offloaded, non-blocking (asynchronous) checkpointing functionalities based on data transfer capabilities provided by a programmable DMA engin...
متن کاملBenefits from Semi-asynchronous Checkpointing for Time Warp Simulations of a Large State Pcs Model
Checkpointing overhead is a major obstacle for the effectiveness of Time Warp parallel discrete event simulators. Semi-asynchronous checkpointing is a recent solution to tackle this obstacle for Time Warp simulations on distributed memory systems based on Myrinet. In this solution, checkpoint operations are offloaded from the host CPU and are charged to a DMA engine on board of Myrinet network ...
متن کاملA Study of Disk Performance Optimization
A STUDY OF DISK PERFORMANCE OPTIMIZATION by Richard S. Gray Response time is one of the most important performance measures associated with a typical multi-user system. Response time, in turn, is bounded by the performance of the input/output (I/O) subsystem. Other than the end user and some external peripherals, the slowest component of the I/O subsystem is the disk drive. One standard strateg...
متن کاملPerformance and Effectiveness Analysis of Checkpointing in Mobile Environments
Many mathematical models have been proposed to evaluate the execution performance of an application with and without checkpointing in the presence of failures. They assume that the total program execution time without failure is known in advance, under which condition the optimal checkpointing interval can be determined. In mobile environments, application components are distributed and tasks a...
متن کامل